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Abstract.  This  paper  derives  exponential  concentration  inequalities  and  polynomial  moment  in¬ 
equalities  for  the  spectral  norm  of  a  random  matrix.  The  analysis  requires  a  matrix  extension  of 
the  scalar  concentration  theory  developed  by  Sourav  Chatterjee  using  Stein’s  method  of  exchange¬ 
able  pairs.  When  applied  to  a  sum  of  independent  random  matrices,  this  approach  yields  matrix 
generalizations  of  the  classical  inequalities  due  to  Hoeffding,  Bernstein,  Khintchine,  and  Rosenthal. 
The  same  technique  delivers  bounds  for  sums  of  dependent  random  matrices  and  more  general 
matrix-valued  functions  of  dependent  random  variables. 


This  paper  is  based  on  two  independent  manuscripts  from  mid-2011  that  both  applied  the  method 
of  exchangeable  pairs  to  establish  matrix  concentration  inequalities.  One  manuscript  is  by  Mackey 
and  Jordan;  the  other  is  by  Chen,  Farrell,  and  Tropp.  The  authors  have  combined  this  research 
into  a  single  unified  presentation,  with  equal  contributions  from  both  groups. 

1.  Introduction 

Matrix  concentration  inequalities  control  the  fluctuations  of  a  random  matrix  about  its  mean. 
At  present,  these  results  provide  an  effective  method  for  studying  sums  of  independent  random 
matrices  and  matrix  martingales  [Oli09,  Trolla,  Trollb,  HKZ11],  They  have  been  used  to  stream¬ 
line  the  analysis  of  structured  random  matrices  in  a  range  of  applications,  including  statistical 
estimation  [Kolll],  randomized  linear  algebra  [Git  11,  CDllb],  stability  of  least-squares  approxima¬ 
tion  [CDL11],  combinatorial  and  robust  optimization  [Soil,  CSW11],  matrix  completion  [Groll, 
Recll,  MTJ11],  and  random  graph  theory  [Oli09[.  These  works  comprise  only  a  small  sample  of 
the  papers  that  rely  on  matrix  concentration  inequalities.  Nevertheless,  it  remains  common  to 
encounter  new  classes  of  random  matrices  that  we  cannot  treat  with  the  available  techniques. 

The  purpose  of  this  paper  is  to  lay  the  foundations  of  a  new  approach  for  analyzing  structured 
random  matrices.  Our  work  is  based  on  Chatterjee’s  technique  for  developing  scalar  concentration 
inequalities  [Cha07,  Cha08]  via  Stein’s  method  of  exchangeable  pairs  [Ste72],  We  extend  this 
argument  to  the  matrix  setting,  where  we  use  it  to  establish  exponential  concentration  results 
(Theorems  4.1  and  5.1)  and  polynomial  moment  inequalities  (Theorem  7.1)  for  the  spectral  norm 
of  a  random  matrix. 

To  illustrate  the  power  of  this  idea,  we  show  that  our  general  results  imply  several  important  con¬ 
centration  bounds  for  a  sum  of  independent,  random,  Hermitian  matrices  [LPP91,  JX03,  Trollb]. 
We  obtain  a  matrix  Hoeffding  inequality  with  optimal  constants  (Corollary  4.2)  and  a  version  of 
the  matrix  Bernstein  inequality  (Corollary  5.2).  Our  techniques  also  yield  concise  proofs  of  the 
matrix  Khintchine  inequality  (Corollary  7.4)  and  the  matrix  Rosenthal  inequality  (Corollary  7.5). 

The  method  of  exchangeable  pairs  also  applies  to  matrices  constructed  from  dependent  random 
variables.  We  offer  a  hint  of  the  prospects  by  establishing  concentration  results  for  several  other 
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classes  of  random  matrices.  In  Section  9,  we  consider  sums  of  dependent  matrices  that  satisfy 
a  conditional  zero-mean  property.  In  Section  10,  we  treat  a  broad  class  of  combinatorial  matrix 
statistics.  Finally,  in  Section  11,  we  analyze  general  matrix-valued  functions  that  have  a  self- 
reproducing  property. 


1.1.  Notation  and  Preliminaries.  The  symbol  ||-||  is  reserved  for  the  spectral  norm,  which 
returns  the  largest  singular  value  of  a  general  complex  matrix. 

We  write  for  the  algebra  of  all  d  x  d  complex  matrices.  The  trace  and  normalized  trace  of  a 
square  matrix  are  defined  as 

tr  B  :=  i  bjj  and  tr  B  :=  —  ^  bjj  for  B  €  Mrf. 

We  define  the  linear  space  of  Hermitian  d  x  d  matrices.  All  matrices  in  this  paper  are  Her- 
mitian  unless  explicitly  stated  otherwise.  The  symbols  Amax(A)  and  Amin(A)  refer  to  the  algebraic 
maximum  and  minimum  eigenvalues  of  a  matrix  A  £  Md.  For  each  interval  I  C  K,  we  define  the 
set  of  Hermitian  matrices  whose  eigenvalues  fall  in  that  interval: 


md(I)  :=  {A  <E  :  Amax(A),  Amin(A)  £  /}. 


The  set  EI+  consists  of  all  positive-semidehnite  (psd)  d  x  d  matrices.  Curly  inequalities  refer  to  the 
semidefinite  partial  order  on  Hermitian  matrices.  For  example,  we  write  A  B  to  signify  that  the 
matrix  B  —  A  is  psd. 

We  require  operator  convexity  properties  of  the  matrix  square  so  often  that  we  state  them  now. 


(A  +  B  \2  A2  +  B2 


for  all  A,B  £  Hd. 


(1.1) 


More  generally,  we  have  the  operator  Jensen  inequality 

(Elf^EX2,  (1.2) 

1 1  1 1 2 

valid  for  any  random  Hermitian  matrix,  provided  that  E  ||X||  <  oo.  To  verify  this  result,  simply 

expand  the  inequality  E(A-EI')2  >  0.  The  operator  Jensen  inequality  also  holds  for  conditional 

1 1 2 

expectation,  again  provided  that  E  ||X||  <  oo. 


2.  Exchangeable  Pairs  of  Random  Matrices 

Our  approach  to  studying  random  matrices  is  based  on  the  method  of  exchangeable  pairs,  which 
originates  in  the  work  of  Charles  Stein  [Ste72]  on  normal  approximation  for  a  sum  of  dependent 
random  variables.  In  this  section,  we  explain  how  some  central  ideas  from  this  theory  extend  to 
matrices. 

2.1.  Matrix  Stein  Pairs.  We  begin  with  the  definition  of  an  exchangeable  pair. 

Definition  2.1  (Exchangeable  Pair).  Let  Z  and  Z'  be  random  variables  taking  values  in  a  Polish 
space1  Z.  We  say  that  (Z,  Z')  is  an  exchangeable  pair  if  it  has  the  same  distribution  as  (Z1,  Z).  In 
particular,  Z  and  Z'  must  share  the  same  distribution. 

We  can  obtain  a  lot  of  information  about  the  fluctuations  of  a  random  matrix  X  if  we  can 
construct  a  good  exchangeable  pair  (X,X').  With  this  motivation  in  mind,  let  us  introduce  a 
special  class  of  exchangeable  pairs. 


1A  topological  space  is  Polish  if  we  can  equip  it  with  a  metric  to  form  a  complete,  separable  metric  space. 
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Definition  2.2  (Matrix  Stein  Pair).  Let  (Z,  Z ')  be  an  exchangeable  pair  of  random  variables  taking 
values  in  a  Polish  space  Z,  and  let  ’3/  :  Z  ->  be  a  measurable  function.  Define  the  random 
Hermitian  matrices 

X  :=  V(Z)  and  X’  :=  'Z(Z'). 

We  say  that  (X,  X')  is  a  matrix  Stein  pair  if  there  is  a  constant  a  £  (0, 1]  for  which 

E[X  —  X'  |  Z]  =  aX  almost  surely.  (2-1) 

The  constant  a  is  called  the  scale  factor  of  the  pair.  When  discussing  a  matrix  Stein  pair  (X,  X'), 
we  always  assume  that  E  ||X||  <  oo. 

A  matrix  Stein  pair  (X,X/)  has  several  useful  properties.  First,  (X,X/)  always  forms  an 
exchangeable  pair.  Second,  it  must  be  the  case  that  EX  =  0.  Indeed, 

EX  =  -E[  E[X  -  X'\Z\]  =  -  E[X  -  X']  =  0 
a  a 

because  of  the  identity  (2.1),  the  tower  property  of  conditional  expectation,  and  the  exchangeability 
of  (X,X7).  In  Section  2.4,  we  construct  a  matrix  Stein  pair  for  a  sum  of  centered,  independent 
random  matrices.  More  sophisticated  examples  appear  in  Sections  9,  10,  and  11. 


2.2.  The  Method  of  Exchangeable  Pairs.  A  well-chosen  matrix  Stein  pair  (X,X')  provides  a 
surprisingly  powerful  tool  for  studying  the  random  matrix  X.  The  technique  depends  on  a  simple 
but  fundamental  technical  lemma. 

Lemma  2.3  (Method  of  Exchangeable  Pairs).  Suppose  that  (X,  X')  £  EE*  x  is  a  matrix  Stein 
pair  with  scale  factor  a.  Let  F  :  -A  be  a  measurable  function  that  satisfies  the  regularity 

condition 

E||(X-X/)-E(X)||  <oo.  (2.2) 

Then 

E  [X  •  F(X)}  =  —  E  !(X  -  X')(F(X)  -  F(X'))1  .  (2.3) 

2a  L  J 

To  appreciate  this  result,  recall  that  we  can  characterize  the  distribution  of  a  random  matrix  by 
integrating  it  against  a  sufficiently  large  class  of  test  functions.  The  additional  randomness  in  the 
Stein  pair  furnishes  an  alternative  expression  for  the  expected  product  of  X  and  the  test  function 
F.  The  identity  (2.3)  is  valuable  because  it  allows  us  to  estimate  this  integral  using  the  smoothness 
properties  of  the  function  F  and  the  discrepancy  between  X  and  X7. 

Proof.  Suppose  (X,X/)  is  a  matrix  Stein  pair  constructed  from  an  auxiliary  exchangeable  pair 
(Z,  Z').  The  defining  property  (2.1)  of  the  Stein  pair  implies 

a  •  E[X  •  F(X)\  =  E  [E[X  -  X' \  Z\  ■  F(X)]  =  E[(X  -  X')  F(X)]. 

We  have  used  the  regularity  condition  (2.2)  to  invoke  the  pull-through  property  of  conditional 
expectation.  Since  (X,X7)  is  an  exchangeable  pair, 

a  •  E[X  •  F(X)]  =  E[(X  -  X’)  F(X)]  =  E[(X'  -  X)  F(X')}  =  -  E[(X  -  X')  F{X')}. 


The  identity  (2.3)  follows  when  we  average  the  two  preceding  displays. 


□ 
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2.3.  The  Conditional  Variance.  To  each  matrix  Stein  pair  (V,  A7),  we  may  associate  a  random 
matrix  called  the  conditional  variance  of  X .  The  ultimate  purpose  of  this  paper  is  to  argue  that 
the  spectral  norm  of  X  is  unlikely  to  be  large  when  the  conditional  variance  is  small. 

Definition  2.4  (Conditional  Variance).  Suppose  that  (V,  X')  is  a  matrix  Stein  pair,  constructed 
from  an  auxiliary  exchangeable  pair  (Z,  Zr).  The  conditional  variance  is  the  random  matrix 

Ax:=Ax(Z):=^-E[(X-X’)2\Z],  (2.4) 

where  a  is  the  scale  factor  of  the  pair.  We  may  take  any  version  of  the  conditional  expectation  in 
this  definition. 

The  conditional  variance  Ax  should  be  regarded  as  a  stochastic  estimate  for  the  variance  of  the 
random  matrix  X.  Indeed, 

E[Ax]  =  E  A2.  (2.5) 

This  identity  follows  immediately  from  Lemma  2.3  with  the  choice  F(X)  =  X. 

2.4.  Example:  A  Sum  of  Independent  Random  Matrices.  To  make  the  definitions  in  this 
section  more  vivid,  we  describe  a  simple  but  important  example  of  a  matrix  Stein  pair.  Consider 
an  independent  sequence  Z  :=  {Y\ , . . .  ,Yn)  of  random  Hermitian  matrices  that  satisfy  £7^  =  0 
and  E  ||Yfc||  <  oo  for  each  k.  Introduce  the  random  series 

X  :=Y\  +  ■■■  +  Yn. 

Let  us  explain  how  to  build  a  good  matrix  Stein  pair  (A,  A7).  We  need  the  exchangeable 
counterpart  X’  to  have  the  same  distribution  as  X,  but  it  should  also  be  close  to  X  so  that  we  can 
control  the  conditional  variance.  To  achieve  these  goals,  we  construct  X'  by  picking  a  summand 
from  X  at  random  and  replacing  it  with  a  fresh  copy. 

Formally,  let  Y)(  be  an  independent  copy  of  Yj.  for  each  index  k,  and  draw  a  random  index 
K  uniformly  at  random  from  {1 , ,n}  and  independently  from  everything  else.  Define  the  ran¬ 
dom  sequence  Z'  :=  (Yi , . . . ,  Yk- i,  Y^,  Yk+ i,  • . . ,  Y^).  It  is  easy  to  check  that  (Z.  Z')  forms  an 
exchangeable  pair.  The  random  matrix 

X’  \=Yi~\ - +  YK-i  +  Y’k  +  Yk+i  +  •  •  •  +  Yn 

is  thus  an  exchangeable  counterpart  for  X.  To  verify  that  (A,  A7)  is  a  Stein  pair,  calculate  that 

E[X  -X'\Z}=  E [YK  -  1*  |  Z]  =  I  ^”_1  E[U  -  *2  |  Z]  =  I  £”=1  n  =  W 

The  third  identity  holds  because  Y ^  is  a  centered  random  matrix  that  is  independent  from  Z. 
Therefore,  (A,  A')  is  a  matrix  Stein  pair  with  scale  factor  a  =  n  1 . 

Next,  we  compute  the  conditional  variance. 

Ai  =  ^E[(X- X'f  |  Z\ 

=  l  ELt  W  -  Y^n)  -  <e  non + e«)2] 

=  iEI,1W2  +  EF)-  (2.6) 

For  the  third  relation,  expand  the  square  and  invoke  the  pull-through  property  of  conditional 
expectation.  We  may  drop  the  conditioning  because  Y^  is  independent  from  Z .  In  the  last  line,  we 
apply  the  property  that  Yj[  has  the  same  distribution  as  Y);. 
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The  expression  (2.6)  shows  that  we  can  control  the  size  of  the  conditional  expectation  uniformly 
if  we  can  control  the  size  of  the  individual  summands.  This  example  also  teaches  us  that  we  may 
use  the  symmetries  of  the  distribution  of  the  random  matrix  to  construct  a  matrix  Stein  pair. 

3.  Exponential  Moments  and  Eigenvalues  of  a  Random  Matrix 

Our  main  goal  in  this  paper  is  to  study  the  behavior  of  the  extreme  eigenvalues  of  a  random 
Hermitian  matrix.  In  Section  3.2,  we  describe  an  approach  to  this  problem  that  parallels  the 
classical  Laplace  transform  method  for  scalar  random  variables.  The  adaptation  to  the  matrix 
setting  leads  us  to  consider  the  trace  of  the  moment  generating  function  (mgf)  of  a  random  matrix. 
After  presenting  this  background,  we  explain  how  the  method  of  exchangeable  pairs  can  be  used 
to  control  the  growth  of  the  trace  mgf.  This  result,  which  appears  in  Section  3.5,  is  the  key  to  our 
exponential  concentration  bounds  for  random  matrices. 

3.1.  Standard  Matrix  Functions.  Before  entering  the  discussion,  recall  that  a  standard  matrix 
function  is  obtained  by  applying  a  real  function  to  the  eigenvalues  of  a  Hermitian  matrix.  The 
book  [Hig08]  provides  an  excellent  treatment  of  this  concept. 

Definition  3.1  (Standard  Matrix  Function).  Let  /:/—>•  M  be  a  function  on  an  interval  I  of  the 
real  line.  Suppose  that  A  £  Hd(J)  has  the  eigenvalue  decomposition  A  =  Q  ■  diag(Ai, . . . ,  A d)  •  Q* 
where  Q  is  a  unitary  matrix.  Then 

7(Ai) 

f(A):=Q  •.  Q*. 

/(A  d). 

The  spectral  mapping  theorem  states  that  /(A)  is  an  eigenvalue  of  /(A)  if  and  only  if  A  is  an 
eigenvalue  of  A.  This  fact  follows  immediately  from  Definition  3.1. 

When  we  apply  a  familiar  scalar  function  to  a  Hermitian  matrix,  we  are  always  referring  to  a 
standard  matrix  function.  For  instance,  |A|  is  the  matrix  absolute  value,  exp(A)  is  the  matrix 
exponential,  and  log(A)  is  the  matrix  logarithm.  The  latter  is  defined  only  for  positive  matrices. 

3.2.  The  Matrix  Laplace  Transform  Method.  Let  us  introduce  a  matrix  variant  of  the  classical 
moment  generating  function.  We  learned  this  definition  from  Ahlswede-Winter  [AW02,  App.]. 

Definition  3.2  (Trace  Mgf).  Let  I  be  a  random  Hermitian  matrix.  The  ( normalized )  trace 
moment  generating  function  of  X  is  defined  as 

m(6)  :=  mx{0)  :=  Etreex  for  6  £  M. 

We  admit  the  possibility  that  the  expectation  may  not  exist  for  all  values  of  9. 

Ahlswede  and  Winter  [AW02,  App.]  had  the  insight  that  the  classical  Laplace  transform  method 
could  be  extended  to  the  matrix  setting  by  replacing  the  classical  mgf  with  the  trace  mgf.  This 
adaptation  allows  us  to  obtain  concentration  inequalities  for  the  extreme  eigenvalues  of  a  random 
Hermitian  matrix  using  methods  from  matrix  analysis.  The  following  proposition  distills  results 
from  the  papers  [AW02,  OlilO,  Trollb,  CGT12]. 

Proposition  3.3  (Matrix  Laplace  Transform  Method).  Let  X  £  be  a  random  matrix  with 
normalized  trace  mgf  m{6)  :=  E  tr  eox .  For  each  t  £  M, 

P  |Amax(Af )  >  t}  <  d  ■  inf  exp{— Ot  +  log  m(9)\. 

9>  0 

IP  {Amin (AC )  <  t}  <  d  ■  inf  exp{— Ot  +  log  m(6)\. 

e<o 


(3.1) 

(3.2) 
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Furthermore, 

EAmax(X)  <  inf  \  [log d  +  log m(9)] .  (3.3) 

0>o  6 

E  Amin(X)  >  sup  \  [log  d  +  log  m(9)\ .  (3.4) 

e»<o  V 

The  estimates  (3.3)  and  (3.4)  for  the  expectations  are  usually  sharp  up  to  the  logarithm  of  the 
dimension.  In  many  situations,  the  tail  bounds  (3.1)  and  (3.2)  are  reasonable  for  moderate  t,  but 
they  tend  to  overestimate  the  probability  of  a  large  deviation.  Note  that,  in  general,  we  cannot 
dispense  with  the  dimensional  factor  d.  See  [Trollb,  Sec.  4]  for  a  detailed  discussion  of  these  issues. 
Additional  inequalities  for  the  interior  eigenvalues  can  be  established  using  the  minimax  Laplace 
transform  method  [GT11]. 

Proof.  To  establish  (3.1),  fix  9  >  0.  Using  Markov’s  inequality,  we  find  that 

P{Amax(X)  >t}  =  P  {eAmax^  >  ew}  <  e~ot  •  EeAmax(ex) 

=  e~et  •EAmax(e0X)  <e"w-Etre0X. 

The  third  relation  follows  from  the  spectral  mapping  theorem.  The  final  inequality  holds  because 
the  trace  of  a  positive  matrix  dominates  its  maximum  eigenvalue.  Identify  the  normalized  trace 
mgf,  and  take  the  infimum  over  9  to  complete  the  argument. 

The  proof  of  (3.2)  parallels  the  proof  of  (3.1).  For  9  <  0, 

P  {Amin  (X)  <  t]  =  P{9Xmin(X)  >  9t}  =  ¥{Xmax(9X)  >  9t} . 

We  have  used  the  property  that  — Amin(A)  =  Amax(— A)  for  each  Hermitian  matrix  A.  The  remain¬ 
der  of  the  argument  is  the  same  as  in  the  preceding  paragraph. 

For  the  expectation  bound  (3.3),  fix  9  >  0.  Jensen’s  inequality  yields 

EAmax(X)  =  9~1EXm^(9X)  <  0-1logEeAmax(ex)  <  0”1  logEtre0X. 

The  justification  is  the  same  as  above.  Identify  the  normalized  trace  mgf,  and  take  the  infimum 
over  9  >  0.  Similar  considerations  yield  (3.4).  □ 

3.3.  Studying  the  Trace  Mgf  with  Exchangeable  Pairs.  The  technical  difficulty  in  the  matrix 
Laplace  transform  method  arises  because  we  need  to  estimate  the  trace  mgf.  Previous  authors 
have  applied  deep  results  from  matrix  analysis  to  accomplish  this  bound:  the  Golden-  Thompson 
inequality  is  central  to  [AW02,  Oli09,  OlilO],  while  Lieb’s  theorem  [Lie73,  Thm.  6]  animates  [Trolla, 
Trollb,  HKZ11], 

In  this  paper,  we  develop  a  fundamentally  different  technique  for  studying  the  trace  mgf.  The 
main  idea  is  to  control  the  growth  of  the  trace  mgf  by  bounding  its  derivative.  To  see  why  we  have 
adopted  this  strategy,  consider  a  random  Hermitian  matrix  X,  and  observe  that  the  derivative  of 
its  trace  mgf  can  be  written  as 

m'{9)  =Etr  [Xedx] 

under  appropriate  regularity  conditions.  This  expression  has  just  the  form  that  we  need  to  invoke 
the  method  of  exchangeable  pairs,  Lemma  2.3,  with  F(X)  =  eex .  We  obtain 

m'{9)  =  Etr  \(X  -  X')(eex  -  eex')}.  (3.5) 

2a 

This  formula  strongly  suggests  that  we  should  apply  a  mean  value  theorem  to  control  the  derivative; 
we  establish  the  result  that  we  need  in  Section  3.4  below.  Ultimately,  this  argument  leads  to  a 
differential  inequality  for  m!{9 ),  which  we  can  integrate  to  obtain  an  estimate  for  m{9). 
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The  technique  of  bounding  the  derivative  of  an  mgf  lies  at  the  heart  of  the  log-Sobolev  method 
for  studying  concentration  phenomena  [LedOl,  Ch.  5].  Recently,  Chatterjee  [Cha07,  Cha08]  demon¬ 
strated  that  the  method  of  exchangeable  pairs  provides  another  way  to  control  the  derivative  of 
an  mgf.  Our  arguments  closely  follow  the  pattern  set  by  Chatterjee;  the  novelty  inheres  in  the 
extension  of  these  ideas  to  the  matrix  setting  and  the  striking  applications  that  this  extension 
permits. 


3.4.  The  Mean  Value  Trace  Inequality.  To  bound  the  expression  (3.5)  for  the  derivative  of 
the  trace  mgf,  we  need  a  matrix  generalization  of  the  mean  value  theorem  for  a  function  whose 
derivative  is  convex.  We  state  the  result  in  full  generality  because  it  plays  a  role  later. 

Lemma  3.4  (Mean  Value  Trace  Inequality).  Let  I  be  an  interval  of  the  real  line.  Suppose  that 
g  :  I  — >  M  is  a  weakly  increasing  function  and  that  h  :  I  — >  M  is  a  function  whose  derivative  h'  is 
convex.  For  all  matrices  A.  B  G  Hrf(I),  it  holds  that 

fr  [(9(A)  -  9(B))  •  (h(A)  -  h(B))\  <  1  &  [(9(A)  -  9(B))  •  (A  -  B)  •  (ft'(A)  +  fc'(B))]  . 

When  h!  is  concave,  the  inequality  is  reversed.  The  same  results  hold  for  the  standard  trace. 

To  prove  Lemma  3.4,  we  need  a  standard  trace  inequality  [Pet94,  Prop.  3],  which  is  an  easy 
consequence  of  the  spectral  theorem  for  Hermitian  matrices. 


Proposition  3.5  (Generalized  Klein  Inequality).  Let  u\, . . . ,  un  and  v\ , . . . ,  vn  be  real-valued  func¬ 
tions  on  an  interval  I  of  the  real  line.  Suppose  that 


Then 


^ ~2kuk{a)vk{b )  >  0 


for  all  a,b  €  I. 


tr  ^2kuk(A)vk(B ) 


>  0 


for  all  A,B  e  Hd(/). 


(3.6) 


With  the  generalized  Klein  inequality  at  hand,  we  can  establish  Lemma  3.4  by  developing  the 
appropriate  scalar  inequality. 


Proof  of  Lemma  3. 4-  Fix  a,b  G  I.  Since  g  is  weakly  increasing,  ( g(a )  —  g{b))  ■  (a  —  b)  >  0.  The 
fundamental  theorem  of  calculus  and  the  convexity  of  h'  yield  the  estimate 

(g(a)  ~  g(b ))  •  (h(a)  -  h(b))  =  ( g(a )  -  g(b))  ■  {a  -  b)  f  h'{ra  +  (1  -  r)b)  dr 

Jo 

<  (g(a)  -  g(b))  ■  (a-b)  (  [r  •  h'{a)  +  (1  -  r)  •  h'{b)\  dr 

Jo 

=  \  [(5(o)  -  gib))  •  ( a-b )  •  (h'(a)  +  ti{b))\  .  (3.7) 

The  inequality  is  reversed  when  h!  is  concave. 

The  bound  (3.7)  can  be  written  in  the  form  (3.6)  by  expanding  the  products  and  collecting  terms 
depending  on  a  into  functions  uk(a)  and  terms  depending  on  b  into  functions  vk{b).  Proposition  3.5 
then  delivers  a  trace  inequality,  which  can  be  massaged  into  the  desired  form  using  the  cyclicity  of 
the  trace  and  the  fact  that  standard  functions  of  the  same  matrix  commute.  We  omit  the  algebraic 
details.  □ 


Remark  3.6.  We  must  warn  the  reader  that  the  proof  of  Lemma  3.4  succeeds  because  the  trace 
contains  a  product  of  three  terms  involving  two  matrices.  The  obstacle  to  proving  more  general 
results  is  that  we  cannot  reorganize  expressions  like  tr  (ABAB)  and  tr  [ABC)  at  will. 
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3.5.  Bounding  the  Derivative  of  the  Trace  Mgf.  The  central  result  in  this  section  applies  the 
the  method  of  exchangeable  pairs  and  the  mean  value  trace  inequality  to  bound  the  derivative  of 
the  trace  mgf  in  terms  of  the  conditional  variance.  This  is  the  most  important  step  in  our  theory 
on  the  exponential  concentration  of  random  matrices. 

Lemma  3.7  (The  Derivative  of  the  Trace  Mgf).  Suppose  that  ( X,X' )  6  x  is  a  matrix 
Stein  pair,  and  assume  that  X  is  almost  surely  bounded  in  norm.  Define  the  normalized  trace  mgf 
m{9)  :=  Etr  eex .  Then 

m'(9)  <  9  ■  Etr  [Ax  e0X]  when  9  >  0.  (3.8) 

m'{6)  >  9  •  Etr  [Ax  e^]  when  9  <  0.  (3.9) 

The  conditional  variance  Ax  is  defined  in  (2.4). 

Proof.  Let  us  begin  with  the  expression  for  the  derivative  of  the  trace  mgf: 

m!{9)  =  Etr  eex  =  Etr  [Xeex] .  (3.10) 

We  can  move  the  derivative  inside  the  expectation  because  of  the  dominated  convergence  theorem 
and  the  boundedness  of  X . 

Now,  we  apply  the  method  of  exchangeable  pairs,  Lemma  2.3,  with  F(X)  =  eex  to  identify  an 
alternative  representation  of  the  derivative  (3.10): 

m'(0)  =  Etr  \(X  -  X')(eex  -e0X')l.  (3.11) 

2a 

We  have  used  the  boundedness  of  X  to  verify  the  regularity  condition  (2.2). 

The  expression  (3.11)  is  perfectly  suited  for  an  application  of  the  mean  value  trace  inequality, 
Lemma  3.4.  First,  assume  that  9  >  0,  and  consider  the  function  h  :  s  ha  e9s.  The  derivative 
hi  :  s  e- 9cds  is  convex,  so  Lemma  3.4  implies  that 

m'(9)  <  Etr  [(X  -  X')2  •  ( eex  +  eex')} 

=  ^-Etr  [(X  -  X')2  -e0X] 

=  9  ■  Etr  E  [(X  -  X')2  |  Z\  ■  eex  . 

The  second  line  follows  from  the  fact  that  (X,  X')  is  an  exchangeable  pair.  In  the  last  line,  we  have 
used  the  boundedness  of  X  and  X'  to  invoke  the  pull-through  property  of  conditional  expectation. 
Identify  the  conditional  variance  Ax,  defined  in  (2.4),  to  complete  the  argument. 

The  result  for  9  <  0  follows  from  an  analogous  argument.  In  this  case,  we  simply  observe  that 
the  derivative  of  the  function  h  :  s  e- >•  e9s  is  now  concave,  so  the  mean  value  trace  inequality, 
Lemma  3.4,  produces  a  lower  bound.  The  remaining  steps  are  identical.  □ 

Remark  3.8  (Regularity  Conditions).  To  simplify  the  presentation,  we  have  instated  a  boundedness 
assumption  in  Lemma  3.7.  All  the  examples  we  discuss  satisfy  this  requirement.  When  X  is 
unbounded,  Lemma  3.7  still  holds  provided  that  X  meets  an  integrability  condition. 

4.  Exponential  Concentration  for  Bounded  Random  Matrices 

We  are  now  prepared  to  establish  exponential  concentration  inequalities.  Our  first  major  result 
demonstrates  that  an  almost-sure  bound  for  the  conditional  variance  yields  exponential  tail  bounds 
for  the  extreme  eigenvalues  of  a  random  Hermitian  matrix.  We  can  also  obtain  estimates  for  the 
expectation  of  the  extreme  eigenvalues. 
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Theorem  4.1  (Concentration  for  Bounded  Random  Matrices).  Consider  a  matrix  Stein  pair 
(X,X')  G  x  Hrf.  Suppose  there  exist  nonnegative  constants  c,v  for  which  the  conditional 
variance  (2.4)  of  the  pair  satisfies 

Ax  ^  cX  +  v  I  almost  surely.  (4-1) 

Then,  for  all  t  >  0, 

P  {Amin  (A)  <  ~t}  <  d  •  exp 
P{Amax(A)  >  t}  <  d-  exp  j-*  +  ^  log  ^1  +  ^| 

Furthermore, 

E  Amin  (X)  >  -y/2v  log  d 
EAmax(X)<  yj2v  log  d  +  clog  d. 

This  result  may  be  viewed  as  a  matrix  analogue  of  Chatterjee’s  concentration  inequality  for  scalar 
random  variables  [Cha07,  Thm.  1.5(h)].  The  proof  of  Theorem  4.1  appears  below  in  Section  4.2. 
Before  we  present  the  argument,  let  us  explain  how  the  result  provides  a  short  proof  of  a  Hoeffding- 
type  inequality  for  matrices. 

4.1.  Application:  Matrix  Hoeffding  Inequality.  Theorem  4.1  yields  an  extension  of  Hoeffd- 
ing’s  inequality  [Hoe63]  that  holds  for  an  independent  sum  of  bounded  random  matrices. 

Corollary  4.2  (Matrix  Hoeffding).  Consider  a  finite  sequence  (Yk)k> l  of  independent  random 
matrices  in  and  a  finite  sequence  (Ak)k> l  of  deterministic  matrices  in  Hrf.  Assume  that 

EYj.  =  0  and  Yk  =4  A2k  for  each  index  k. 

Then,  for  all  t  >  0, 

p { Amax  (J2k  ^  *}  < d  ■ e_t2/2ff2  where  :=  \  ||EA:  + E Yk ) |  • 

Furthermore, 


Proof.  Let  X  =  Y),.  Since  A  is  a  sum  of  centered,  independent  random  matrices,  we  can 

use  the  matrix  Stein  pair  constructed  in  Section  2.4.  According  to  (2.6),  the  conditional  variance 
satisfies 

Ax  =  \Y.k  W  + 

because  Y2  p,  A2.  Invoke  Theorem  4.1  with  c  =  0  and  v  =  a2  to  complete  the  bound.  □ 

In  the  scalar  setting  d  =  1,  Corollary  4.2  reproduces  to  an  inequality  of  Chatterjee  [Cha07, 
Sec.  1.5],  which  itself  represents  an  improvement  over  the  classical  scalar  Hoeffding  bound.  In  turn, 
Corollary  4.2  improves  upon  the  matrix  Hoeffding  inequality  of  [Trollb,  Thm.  1.3]  in  two  ways. 
First,  we  have  reduced  the  constant  in  the  exponent  to  its  optimal  value  1/2.  Second,  we  have 
decreased  the  size  of  the  variance  measure  because  a2  <  Ak\\. 
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4.2.  Proof  of  Theorem  4.1:  Exponential  Concentration.  Suppose  that  (X,  X')  is  a  matrix 
Stein  pair  constructed  from  an  auxiliary  exchangeable  pair  (Z,Z').  Our  aim  is  to  bound  the 
normalized  trace  mgf 

m(9)  :  =  Etr  eSX  for  9  €  M.  (4-2) 

The  basic  strategy  is  to  develop  a  differential  inequality,  which  we  integrate  to  control  m{6)  itself. 
Once  these  estimates  are  in  place,  the  matrix  Laplace  transform  method,  Proposition  3.3,  furnishes 
probability  inequalities  for  the  extreme  eigenvalues  of  X . 

The  following  result  summarizes  our  bounds  for  the  trace  mgf  m{6). 


Lemma  4.3  (Trace  Mgf  Estimates  for  Bounded  Random  Matrices).  Let  (X,  X')  be  a  matrix  Stein 
pair,  and  suppose  there  exist  nonnegative  constants  c,  v  for  which 


Ax  ^  cX  +  v  I  almost  surely. 

Then  the  normalized  trace  mgf  m{9)  :=  Etre0X  satisfies  the  bounds 


v92 

log  m{9 )  <  — 


log  m(9)  <  -y 


< 


log 

v92 


1 


1  -c9 


—  c9 


2(1  -  c9) 


when  9  <  0. 


when  0  <  9  <  1/c. 


(4.3) 

(4.4) 

(4.5) 

(4.6) 


We  establish  Lemma  4.3  in  Section  4.2.1  et  seq.  In  Section  4.2.4,  we  finish  the  proof  of  Theo¬ 
rem  4.1  by  combining  these  bounds  with  the  matrix  Laplace  transform  method. 


4.2.1.  Boundedness  of  the  Random  Matrix.  First,  we  confirm  that  the  random  matrix  X  is  almost 
surely  bounded  under  the  hypothesis  (4.3)  on  the  conditional  variance  Ax-  Recall  the  defini¬ 
tion  (2.4)  of  the  conditional  variance,  and  compute  that 

Ax  =  ^  E[(X  -  X')2  |  Z]  >  ^(E[X  -  X'  |  Z])2  =  ^—(aX)2  =  |x2. 

The  semidefinite  bound  is  the  operator  Jensen  inequality  (1.2),  applied  conditionally.  The  third 
relation  follows  from  the  definition  (2.1)  of  a  matrix  Stein  pair.  Owing  to  the  assumption  (4.3), 
we  reach  the  quadratic  inequality  \aX2  =<I  cX  +  v I.  The  scale  factor  a  is  positive,  so  we  may 
conclude  that  the  eigenvalues  of  X  are  almost  surely  restricted  to  a  bounded  interval. 


4.2.2.  Differential  Inequalities  for  the  Trace  Mgf.  The  fact  that  X  is  almost  surely  bounded  ensures 
that  the  derivative  of  the  trace  mgf  has  the  form 

m'{9 )  =  Etr  [le^]  for  9  <E  R.  (4.7) 

To  bound  the  derivative,  we  combine  Lemma  3.7  with  the  assumed  inequality  (4.3)  for  the  condi¬ 
tional  variance.  For  9  >  0,  we  obtain 

m'{9 )  <  9  ■  E  tr  [Ax  e0X] 

<  0  •  E  tr  \{cX  +  v  I)  e0x] 

=  c9-  E  tr  [Xeex]  +  v9  ■  E  tr  eex 

=  c9  •  m'{9 )  +  v9  ■  m{9). 

The  second  relation  relies  on  the  fact  that  the  matrix  eex  is  positive.  In  the  last  line,  we  have 
identified  the  trace  mgf  (4.2)  and  its  derivative  (4.7).  For  9  <  0,  the  same  argument  yields  a  lower 
bound 


m'{9)  >  c9  ■  m!{9)  +  v9  ■  m{9). 
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Rearrange  these  inequalities  to  isolate  the  log-derivative  m![9)/m(9 )  of  the  trace  mgf.  We  reach 

for  0  <  9  <  1/c,  and  (4.8) 

for  9  <  0.  (4.9) 


Je'ogm{e)  ~  JJJe 


4.2.3.  Solving  the  Differential  Inequalities.  Observe  that 

logm(0)  =  logtre0  =  logtrl  =  log  1  =  0. 


(4.10) 


Therefore,  we  may  integrate  the  differential  inequalities  (4.8)  and  (4.9),  starting  at  zero,  to  obtain 
bounds  on  logm(0)  elsewhere. 

First,  assume  that  0  <  9  <  1/c.  In  view  of  (4.10),  the  fundamental  theorem  of  calculus  and  the 
differential  inequality  (4.8)  imply  that 


f°  d  [e 

log  m(9)  =  /  —  log  m(s)  ds  < 

Jo  d s  Jo 


V  Q  1) 

- - d.s  =  ~^(c9  +  log(l  -  c9)). 

1  —  cs  cz 


We  can  develop  a  weaker  inequality  by  making  a  further  approximation  within  the  integral: 

logm(8)<  /"^ds<  f^-As<  Vgf-. 

J0  l-  cs  J0  1  -  c9  2(1  -  c9) 

These  inequalities  are  the  trace  mgf  estimates  (4.5)  and  (4.6)  appearing  in  Lemma  4.3. 
Next,  assume  that  9  <  0.  In  this  case,  the  differential  inequality  (4.9)  yields 


r°  d  fc 

—  log  m(9)  =  /  —  log  mis)  ds  >  / 
Je  ds  Je 


vs  v9 2 

- d-s  >  /  nsds  = - . 

1  -  CS  Jg  2 


This  calculation  delivers  the  trace  mgf  bound  (4.4),  and  the  proof  of  Lemma  4.3  is  complete. 


4.2.4.  The  Matrix  Laplace  Transform  Argument.  With  Lemma  4.3  at  hand,  we  quickly  finish  the 
proof  of  Theorem  4.1.  First,  let  us  establish  probability  inequalities  for  the  maximum  eigenvalue. 
The  Laplace  transform  bound  (3.1)  and  the  trace  mgf  estimate  (4.5)  together  yield 

lP{Amax(^0  >  t)  <  d  •  inf  exp  j— (9t  -  ^(c#  +  l°g(l  -  c0))  j  <  d  ■  exp  j  — ^  log  fl  +  ^ 


The  second  relation  follows  when  we  choose  9  =  t/(y  +  ct).  Similarly,  the  trace  mgf  bound  (4.6) 
delivers 

{v9 2  -|  r  'i 

-w+2<iWrf)}-dexpr^Txi}’ 

where  we  have  selected  9  =  t/{y  +  ct)  again.  To  control  the  expectation  of  the  maximum  eigenvalue, 
we  invoke  the  Laplace  transform  bound  (3.3)  and  the  trace  mgf  bound  (4.6)  to  see  that 


EAmax(X)<inf  - 

e>o  9 


log  d  + 


>02 


2(1  —  c9)\ 


=  a/ 2v  log  d  +  c  log  d. 


The  second  relation  can  be  verified  using  a  computer  algebra  system. 

Next,  we  turn  to  the  results  for  the  minimum  eigenvalue.  Combine  the  matrix  Laplace  transform 
bound  (3.2)  with  the  trace  mgf  bound  (4.4)  to  reach 


IP  {Amin  (Af)  <  —  f}  <  d  ■  inf  exp 

e<o 


d-e~t2^. 
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The  infimum  is  attained  at  6  =  —t/v.  To  compute  the  expectation  of  the  minimum  eigenvalue,  we 
apply  the  Laplace  transform  bound  (3.4)  and  the  trace  mgf  bound  (4.4),  whence 


i(X)  >  sup  \ 
e<  o  y 


E  Amm 

The  infimum  is  attained  at  6  =  —  y/2u_1  logd. 


vO 2 

log  d  +  — 


=  —  \/2v\ogd. 


5.  Refined  Exponential  Concentration  for  Bounded  Random  Matrices 

Although  Theorem  4.1  is  a  strong  result,  the  hypothesis  Ax  cX  +  rl  on  the  conditional 
variance  is  too  stringent  for  many  situations  of  interest.  Our  second  major  result  shows  that  we 
can  use  the  typical  behavior  of  the  conditional  variance  to  obtain  tail  bounds  for  the  maximum 
eigenvalue  of  a  random  Hermitian  matrix. 

Theorem  5.1  (Refined  Concentration  for  Random  Matrices).  Let  (X,X')  £  x  M.d  be  a  matrix 
Stein  pair,  and  assume  that  X  is  almost  surely  bounded  in  norm.  Define  the  function 


r(if)  :=  —  log  E  tr  e^Ax 

-0 


for  each  if  >  0, 


where  Ax  is  the  conditional  variance  (2.4).  Then,  for  all  t  >  0  and  all  if  >  0, 


P{An 

Furthermore,  for  all  if  >  0, 


( X )  >  t}  <  d  •  exp 


-tA 


2  r(if)  +  2  t/y/if 


EAmax(X)  <  yj2r(if)\ogd+l°^f. 


(5.1) 


(5.2) 


(5.3) 


This  theorem  is  essentially  a  matrix  version  of  a  result  from  Chatterjee’s  thesis  [Cha08,  Thm.  3.13] . 
The  proof  of  Theorem  5.1  is  similar  in  spirit  to  the  proof  of  Theorem  4.1,  so  we  postpone  the  demon¬ 
stration  until  Appendix  A. 

Let  us  offer  some  remarks  to  clarify  the  meaning  of  this  result.  Recall  that  Ax  is  a  stochastic 
approximation  for  the  variance  of  the  random  matrix  X.  We  can  interpret  the  function  r(if)  as  a 
measure  of  the  typical  magnitude  of  the  conditional  variance.  Indeed,  the  matrix  Laplace  transform 
result,  Proposition  3.3,  ensures  that 

,(Ax)  <  inf  1-'-“  ■  l0gdl 


E  Ar 


ij)>  o 


r(if)  + 


if 


The  moral  of  this  inequality  is  that  we  can  often  identify  a  value  of  if  to  make  r(if)  ~  E  Amax(Ax). 
Ideally,  we  also  want  to  choose  r(if)  2>  if _1^2  so  that  the  term  r(if)  drives  the  tail  bound  (5.2) 
when  the  parameter  t  is  small.  In  the  next  subsection,  we  show  that  these  heuristics  yield  a  matrix 
Bernstein  inequality. 

5.1.  Application:  The  Matrix  Bernstein  Inequality.  As  an  illustration  of  Theorem  5.1,  we 
establish  a  tail  bound  for  a  sum  of  centered,  independent  random  matrices  that  are  subject  to  a 
uniform  norm  bound. 

Corollary  5.2  (Matrix  Bernstein).  Consider  an  independent  sequence  (!&)&> i  of  random  matrices 
in  that  satisfy 

EYj.  =  0  and  ||Tfc||  <  R  for  each  index  k. 

Then,  for  all  t  >  0, 


P  |  Amax  Yk )  >  i  j  <  d  ■  exp  j 


3cr2  +  2  Rt 


where 


a2  := 


\^k 


E  Yf 
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Furthermore, 

E  Amax  (M*  cr \J  3  log  d  +  R  log  d. 

Corollary  5.2  is  directly  comparable  with  other  matrix  Bernstein  inequalities  in  the  literature. 
The  constants  here  are  slightly  worse  than  [Trollb,  Thm.  1.4]  and  slightly  better  than  [Oli09, 
Thm.  1.2].  The  hypotheses  in  the  current  result  are  somewhat  stricter  than  those  in  the  prior 
works.  Nevertheless,  the  proof  provides  a  template  for  studying  more  complicated  random  matrices 
that  involve  dependent  random  variables. 


Proof.  We  consider  the  matrix  Stein  pair  (X,X')  described  in  Section  2.4.  The  calculation  (2.6) 
shows  that  the  conditional  variance  of  X  satisfies 

=  +Ey*)- 

The  function  r(if)  measures  the  typical  size  of  Ax-  To  control  r(if),  we  center  the  conditional 
variance  and  reduce  the  expression  as  follows. 

riip)  :=  —  logEtre^’Ax  <  —  logEtrexp  {-ipiAx  —  E  Ax)  +  V*  ||E  Ax  ||  • 

ip  ip  1  ’ 

=  logEtr  [e^’a2  ■  exp  {i)(Ax  —  E  Ax)}] 

=  a2  H - logEtre^Ax~EAx4.  (5-4) 

V 

The  inequality  depends  on  the  monotonicity  of  the  trace  exponential  [Pet94,  Sec.  2],  Afterward,  we 
have  applied  the  identity  ||E  Ax||  =  ||EX2|  =  °2 >  which  follows  from  (2.5)  and  the  independence 
of  the  sequence  (Yk)k>i. 

Introduce  the  centered  random  matrix 

W  :=  Ax  -  E  Ax  =  \  Y,k  (Yk  ~  E  Yk)  ■  M 

Observe  that  W  consists  of  a  sum  of  centered,  independent  random  matrices,  so  we  can  study  it 
using  the  matrix  Stein  pair  discussed  in  Section  2.4.  Adapt  the  conditional  variance  calculation  (2.6) 
to  obtain 

Aw  =  2  '  4  Efc  [(**  -  E  Ykf  +  E  ( Y- 2  -  E  Y2f 
*  l  E,  [2Yk  +  2  ( E  Y2)2  +  E  y£  -  ( E  Y2f 

^iEfc(^4+E^4)- 


To  reach  the  second  line,  we  apply  the  operator  convexity  (1.1)  of  the  matrix  square  to  the  first 
parenthesis,  and  we  compute  the  second  expectation  explicitly.  The  third  line  follows  from  the 
operator  Jensen  inequality  (1.2).  To  continue,  make  the  estimate  Yk  p  i?2  Y)}  in  both  terms. 
Thus, 

+  -w  +  Ff-i. 

The  trace  mgf  bound,  Lemma  4.3,  delivers 

log  mwW  =  log  E  tr  e^w  <  ■ 

To  complete  the  proof,  combine  the  bounds  (5.4)  and  (5.6)  to  reach 


r(V’)  <  cr2  + 


R2a2fj 
4-2  R2^' 


(5.6) 
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In  particular,  it  holds  that  r(R  2)  <  1.5 a2.  The  result  now  follows  from  Theorem  5.1.  □ 

6.  Polynomial  Moments  and  the  Spectral  Norm  of  a  Random  Matrix 

We  can  also  study  the  spectral  norm  of  a  random  matrix  by  bounding  its  polynomial  moments. 
To  progress  toward  these  results,  let  us  introduce  the  family  of  Schatten  norms. 

Definition  6.1  (Schatten  Norm).  For  each  p  >  1,  the  Schatten  p-norm  is  defined  as 

\\B\\p:=  (ti\B\p)1/p  for  B  €  Md. 

In  this  setting,  \B\  :=  ( B*B )1//2.  Bhatia’s  book  [Bha97,  Ch.  IV]  contains  a  detailed  discussion  of 
these  norms  and  their  properties. 

The  following  proposition  is  a  matrix  analog  of  the  Chebyshev  bound  from  classical  probability. 
Proposition  6.2  (Matrix  Chebyshev  Method).  Let  X  be  a  random  matrix.  For  all  t  >  0, 

P{||X||  >  t}  <  inf  t~p -E||Wr  .  (6.1) 

p>  l  p 

Furthermore, 

E  || X ||  <  inf  (E||W||K)1/p.  (6.2) 

P> l  v  p’ 

Proof.  To  prove  (6.1),  we  use  Markov’s  inequality.  For  p  >  1, 

P{||X||  >  t}  <  t~p  •  E  \\X\\P  =  t~p  ■  E  ||  \X\P  ||  <  t~p  •  Etr  \X\P  , 

since  the  trace  of  a  positive  matrix  dominates  the  maximum  eigenvalue.  To  verify  (6.2),  select 
p  >  1 .  Jensen’s  inequality  implies  that 

E  ||X||  <  (E  ||Xf  )1/p  =  (E  ||  |X|P  ||)1/p  <  (Etr  \X\p)l/p  . 

Identify  the  Schatten  p-norrn  and  take  infima  to  complete  the  bounds.  □ 

Remark  6.3  (Chebyshev  vs.  Laplace).  The  matrix  Chebyshev  bound  (6.1)  is  at  least  as  tight  as  the 
analogous  matrix  Laplace  transform  bound  (3.1)  from  Proposition  3.3.  Indeed,  suppose  that  X  is 
a  bounded  random  matrix.  For  all  0,t  >  0,  we  have 

e-w-Etree|x|  =e'wyX  ^(f^)  Etr  \X\q 

^ q=0  q\y  7  11 

>e~et-  V°°  [l  A  inf  (t~p  •  E  tr  |  W|p )]  =  1  A  inf  (t~p  •  E  ||X||P  ) . 

^q= 0  q\  [  P>ly  7J  P>  lv  p> 

The  first  identity  follows  from  the  Taylor  expansion  of  the  matrix  exponential.  A  similar  argument 
allows  us  to  convert  polynomial  moment  bounds  into  bounds  on  the  trace  mgf. 

7.  Polynomial  Moment  Inequalities  for  Random  Matrices 

Our  last  major  result  demonstrates  that  the  polynomial  moments  of  a  random  Hermitian  matrix 
are  controlled  by  the  moments  of  the  conditional  variance.  By  combining  this  result  with  the  matrix 
Chebyshev  method,  Proposition  6.2,  we  can  obtain  probability  inequalities  for  the  spectral  norm 
of  a  random  Hermitian  matrix. 

Theorem  7.1  (Matrix  BDG  Inequality).  Let  p  =  1  or  p  >  1.5.  Suppose  that  (X,X')  is  a  matrix 
Stein  pair  where  E  1 1  1 1  ^  <  oo.  Then 

(E||X|||)1/2p<  v/^l-  (E||Ax||p)1/2p. 

The  conditional  variance  Ax  is  defined  in  (2.4). 


MATRIX  CONCENTRATION  VIA  EXCHANGEABLE  PAIRS 


15 


This  theorem  extends  a  result  of  Chatterjee  [Cha07,  Thm.  1.5(iii)]  to  the  matrix  setting.  Chat- 
terjee’s  bound  can  be  viewed  as  an  exchangeable  pairs  version  of  the  Burkholder-Davis-Gundy 
(BDG)  inequality  from  classical  martingale  theory  [Bur73].  Matrix  extensions  of  the  BDG  inequal¬ 
ity  appear  in  the  work  of  Pisier-Xu  [PX97]  and  the  work  of  Junge-Xu  [JX03,  JX08]. 

The  proof  of  Theorem  7.1  appears  below  in  Section  7.3.  Before  we  present  the  argument,  let  us 
offer  a  few  tangential  remarks  and  describe  a  few  striking  applications  of  this  inequality. 

Remark  7.2  (Missing  Values).  The  Matrix  BDG  inequality  also  holds  when  1  <  p  <  1.5.  In  this 
range,  our  best  bound  for  the  constant  is  p  —  2.  The  proof  requires  a  variant  of  the  mean  value 
trace  inequality  for  a  convex  function  h. 

Remark  7.3  (Infinite-Dimensional  Versions).  Theorem  7.1  and  its  corollaries  extend  to  Schatten- 
class  operators. 

7.1.  Application:  Matrix  Khintchine  Inequality.  First,  we  demonstrate  that  the  matrix  BDG 
inequality  contains  an  improvement  of  the  nonconmiutative  Khintchine  inequality  [LP86,  LPP91] 
in  the  matrix  setting.  This  result  has  been  a  dominant  tool  in  several  application  areas  over  the 
last  few  years,  largely  because  of  the  articles  [Rud99,  RV07]. 

Corollary  7.4  (Matrix  Khintchine).  Suppose  that  p  =  1  or  p  >  1.5.  Consider  a  finite  sequence 
(Yk)k> l  of  independent,  random,  Hermitian  matrices  and  a  deterministic  sequence  (Ak)k> l  for 
which 

EFfc  =  0  and  Y^  P  A\  for  each  index  k.  (7-1) 

Then 


In  particular,  when  ( £k)k>i  an  independent  sequence  of  Rademacher  random  variables, 

(eIEc^C)1/2,>s^'||(e  <»> 

Proof.  Consider  the  random  matrix  X  =  Ylk^k-  We  use  the  matrix  Stein  pair  constructed  in 
Section  2.4.  According  to  (2.6),  the  conditional  variance  Ax  satisfies 

Ax  =  \  Efcft2  +  EEfc2)  *  \  Y^k (A2k  +  EYk2). 

An  application  of  Theorem  7.1  completes  the  argument.  □ 

Buchholz  [BucOl,  Thm.  5]  has  demonstrated  that  the  optimal  constant  C2P  in  (7.2)  satisfies 

C|  =  (2p-1)!!  =  M  for  p  =  1,  2,  3, ... . 

It  can  be  verified  using  basic  analysis  that 

PP-1)”  <  eP-i/2 
(2j>-  1)M 

As  a  consequence,  the  constant  in  (7.2)  lies  within  a  factor  y/e  of  optimal.  Previous  methods 
for  establishing  the  matrix  Khintchine  inequality  are  rather  involved,  so  it  is  remarkable  that  the 
simple  argument  based  on  exchangeable  pairs  leads  to  a  result  that  is  so  accurate. 
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7.2.  Application:  Matrix  Rosenthal  Inequality.  As  a  second  example,  we  can  develop  a 
more  sophisticated  set  of  moment  inequalities  that  are  roughly  the  polynomial  equivalent  of  the 
exponential  moment  bound  underlying  the  matrix  Bernstein  inequality. 


Corollary  7.5  (Matrix  Rosenthal  Inequality).  Suppose  that  p  =  1  or  p  >  1.5.  Consider  a  finite 
sequence  (Pk)k> l  of  independent,  random  psd  matrices  that  satisfy  E  ||Pfc||2p  <  oo.  Then 


1/2 


V  EPfc  ^  +v^2  •  V  E 


2  P 


IAIl|)1/V 


(7.3) 


Now,  consider  a  finite  sequence  (Yk)k> i  of  centered,  independent,  random  Hermitian  matrices,  and 
assume  that  E  ||Ya||4p  <  oo.  Then 


<  y/Ap-  1  ' 


+  (4p-l)-(^E||Ffc|g)1/4P.  (7.4) 


Turn  to  Appendix  B  for  our  proof  of  Corollary  7.5.  This  result  extends  a  moment  inequality 
due  to  Nagaev  and  Pinelis  [NP77],  which  refines  the  constants  in  Rosenthal’s  inequality  [Ros70, 
Lem.  1],  See  the  historical  discussion  [Pin94,  Sec.  5]  for  details.  As  we  were  finishing  this  paper, 
we  learned  that  Junge  and  Zheng  have  recently  established  a  noncommutative  moment  inequal¬ 
ity  [JZ11,  Thm.  0.4]  that  is  quite  similar  to  Corollary  7.5. 


7.3.  Proof  of  the  Matrix  BDG  Inequality.  In  many  respects,  the  proof  of  the  matrix  BDG 
inequality  is  similar  to  the  proof  of  the  exponential  concentration  result,  Theorem  4.1.  Both  are 
based  on  moment  comparison  arguments  that  ultimately  depend  on  the  method  of  exchangeable 
pairs  and  the  mean  value  trace  inequality. 

Suppose  that  (A,  A')  is  a  matrix  Stein  pair  with  scale  factor  a.  First,  observe  that  the  result 
for  p  =  1  already  follows  from  (2.5).  Therefore,  we  may  assume  that  p  >  1.5.  Let  us  introduce 
notation  for  the  quantity  of  interest: 

E  :=E||A||2p  =  Etr|A|2p. 

Rewrite  the  expression  for  E  by  peeling  off  a  copy  of  |  X  \ .  This  move  yields 

E  =  Etr  [|A|  •  |A|2p_1]  =  Etr  [X  ■  sgn  (X)  •  |A|2p_1]. 

Apply  the  method  of  exchangeable  pairs,  Lemma  2.3,  with  F(X)  =  sgn  (A)  •  |A|2p_1  to  reach 

E=  ^Etr  [(A  —  X')  ■  (sgn  (A)  •  |A|2p_1  —  sgn  (A')  ■  |X/|2p-1)]. 

To  verify  the  regularity  condition  (2.2)  we  need  for  Lemma  2.3,  compute  that 

E  ||  (A  —  X')  ■  sgn  (A)  •  |  A|2p_1 1|  <E(|jA||  ||A||2p_1)  +  E(||A'||  ||A||2p_1) 

<  2(E  ||A||2p)1/2p(E  ||A||2p)(2p-1)/2p  =  2 E  ||  A||2p  <  oo. 

We  have  used  the  fact  that  sgn  (A)  is  a  unitary  matrix,  the  exchangeability  of  (A,  A7),  Holder’s 
inequality  for  expectation,  and  the  fact  that  the  Schatten  2p-norm  dominates  the  spectral  norm. 

We  intend  to  apply  the  mean  value  trace  inequality  to  obtain  an  estimate  for  the  quantity  E. 
Consider  the  function  h  :  s  ha  sgn  (s)  •  |s|2p_1.  Its  derivative,  h'(s)  =  (2 p  —  1)  •  |s|2p-2,  is  convex 
because  p  >  1.5.  Lemma  3.4  delivers  the  bound 

E  <  Etr  [(A  -  A')2  •  ( |A|2p_2  +  |A'|2p"2)] 

=  ‘^—^Etr  [(A  -  A')2  •  |A|2p“2] 

2a  /ii 

=  (2p  —  1)  •  Etr  [Ax  ■  |  A|2p-2  ] . 
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The  second  line  follows  from  the  exchangeability  of  X  and  X' .  In  the  last  line,  we  identify  the 
conditional  variance  A.x ,  defined  in  (2.4).  As  before,  the  moment  bound  E  1 1 -XT 1 1 ^  <  oo  is  strong 
enough  to  justify  using  the  pull-through  property  in  this  step. 

To  continue,  we  must  find  a  copy  of  E  within  the  latter  expression.  We  can  accomplish  this  goal 
using  one  of  the  basic  results  from  the  theory  of  Schatten  norms  [Bha97,  Cor.  IV. 2. 6]. 

Proposition  7.6  (Holder  Inequality  for  Trace).  Let  p  and  q  be  Holder  conjugate  indices,  i.e., 
positive  numbers  with  the  relationship  q  =  p/(p  —  1).  Then 

tr (BC)  <  \\B\\p  \\C\\q  for  all  B,C  G  Md. 

To  complete  the  argument,  apply  the  Holder  inequality  for  the  trace  followed  by  the  Holder 
inequality  for  the  expectation.  Thus, 

^<(2P-i).e[||ax||p.|||w|^-2||p/(p_1}; 

=  (2p-1).e[||Ax||p-||W|||-2' 

<  (2p  — 1).  (E||AX||£)1/P-  (E||X||g)(p-1)/p 
=  (2p  —  1)  •  (E  ||Ax||p)1/p  •  _E(p-1)/p. 

Solve  this  algebraic  inequality  for  the  positive  number  E  to  conclude  that 

E<(2p-iy-E\\Ax\\rp. 

Extract  the  (2p)th  root  to  establish  the  matrix  BDG  inequality. 

8.  Extension  to  General  Complex  Matrices 

Although  it  may  seem  that  our  theory  is  limited  to  random  Hermitian  matrices,  results  for 
general  random  matrices  follow  as  a  formal  corollary  [Recll,  Trollb],  The  approach  is  based  on  a 
device  from  operator  theory  [Pau02], 

Definition  8.1  (Hermitian  Dilation).  Let  B  be  a  matrix  in  CdlXd2,  and  set  d  =  d\  +  c?2-  The 
Hermitian  dilation  of  B  is  the  matrix 

®{B):=  q  eWd.  (8.1) 

The  dilation  has  two  valuable  properties.  First,  it  preserves  spectral  information: 

Amax(^(B))  =  \\@{B)\\  =  ||B||.  (8.2) 

Second,  the  square  of  the  dilation  satisfies 

mB)2=^Bf  (8.3) 

We  can  study  a  random  matrix — not  necessarily  Hermitian — by  applying  our  matrix  concentra¬ 
tion  inequalities  to  the  Hermitian  dilation  of  the  random  matrix.  As  an  illustration,  let  us  prove  a 
Bernstein  inequality  for  general  random  matrices. 

Corollary  8.2  (Bernstein  Inequality  for  General  Matrices).  Consider  a  finite  sequence  {Zk)k> l  of 
independent  random  matrices  in  <CdlXd2  that  satisfy 

E  Zk  =  0  and  \\Zk\\  <  R  for  each  index  k. 

Define  d  :=  d\  +  d-2,  and  introduce  the  variance  measure 

u2:=max{||^fcE(ZfcZ£)||,  |^E(Zfe*Zfc)||}  . 
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Then,  for  all  t  >  0, 


Furthermore, 


Zk  >  t  j  <  d  ■  exp 


-if 


3a2  +  2  Rt 


<  <7^/3  log  d  +  R  log  d. 


Proof.  Consider  the  random  series  3>(Zk).  The  summands  are  independent,  random  Hernritian 
matrices  that  satisfy 

E@{Zk)  =  0  and  \\@(Zk)\\  <  R. 


The  second  identity  depends  on  the  spectral  property  (8.2).  Therefore,  the  matrix  Bernstein 
inequality,  Corollary  5.2,  applies.  To  state  the  outcome,  we  first  note  that  Amax(J))fe  @(Zk))  = 
Z/tll,  again  because  of  the  spectral  property  (8.2).  Next,  use  the  formula  (8.3)  to  compute 

that 


Eae[^)2] 


'EfcE  (Zkz*k)  0 

0  ^kE(Z*kZk) 


This  observation  completes  the  proof. 


□ 


9.  The  Sum  of  Conditionally  Zero-Mean  Matrices 

A  chief  advantage  of  the  method  of  exchangeable  pairs  is  its  ability  to  handle  random  matrices 
constructed  from  dependent  random  variables.  In  this  section,  we  briefly  describe  a  way  to  relax 
the  independence  requirement  when  studying  a  sum  of  random  matrices.  In  Sections  10  and  11, 
we  develop  more  elaborate  examples. 

9.1.  Formulation.  Consider  a  finite  sequence  (Yi, . . . ,  Yn)  of  random  Hernritian  matrices.  We  say 
that  the  sequence  has  the  conditional  zero-mean  property  when 

E[Yfc  |  {Yj)j^k\  =  0  almost  surely  for  each  index  k.  (9-1) 

This  definition  is  related  to  the  conditional  expectation  property  of  a  martingale  difference  sequence, 
although  it  is  more  restrictive.  Suppose  that  we  are  interested  in  a  sum  of  conditionally  zero-mean 
random  matrices: 

X  :=  Yi  +  •  •  •  +  Yn.  (9.2) 

This  type  of  series  is  quite  common  because  it  includes  the  case  of  a  Rademacher  series  with  random 
matrix  coefficients. 

Example  9.1  (Rademacher  Series  with  Random  Matrix  Coefficients).  Consider  a  finite  sequence 
(Wk)k>  1  °f  random  Hermitian  matrices.  Suppose  that  the  sequence  {ek)k>  1  consists  of  independent 
Rademacher  random  variables  that  are  independent  from  the  random  matrices.  Consider  the 
random  series 

Efc  £kWk- 

The  summands  may  be  strongly  dependent  on  each  other,  but  the  independence  of  the  Rademacher 
variables  ensures  that  the  summands  satisfy  the  conditional  zero-mean  property  (9.1). 
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9.2.  A  Matrix  Stein  Pair.  Let  us  describe  how  to  build  a  matrix  Stein  pair  (A,  A')  for  the 
sum  (9.2)  of  conditionally  zero-mean  random  matrices.  The  approach  is  similar  to  the  case  of 
an  independent  sum,  which  appears  in  Section  2.4.  For  each  k,  we  draw  a  random  matrix  Y£  so 
that  Yj[  and  Y &  are  conditionally  i.i.d.  given  (Yj)j^k-  Then,  independently,  we  draw  an  index  K 
uniformly  at  random  from  {1, . . .  ,n}.  As  in  Section  2.4,  the  random  matrix 

X'  :=Y\-\ - +  YK-i  +  Y{<  +  Yk+1  +  •  •  •  +  Yn 

is  an  exchangeable  counterpart  to  X .  The  conditional  zero- mean  property  (9.1)  ensures  that 

E[X  -  X'  |  (Yj)j> i]  =  E\YK  -  Y'k  |  (Yj)^ i] 

=  s  Efel  (n  -  EK  i  (UWd)  =  -  Et=1  n  = 

Therefore,  (A,  X')  is  a  matrix  Stein  pair  with  scale  factor  a  =  n  1. 

We  can  determine  the  conditional  variance  after  a  short  argument  that  parallels  the  computation 
(2.6)  in  the  independent  setting: 

Ax  =  \  •  E  [{Yk  -  Y^f  |  (Yj)j>i\  =  ^  J2l=1  (Yk  +  I  (YiM)-  (9-3) 

The  expression  (9.3)  shows  that,  even  in  the  presence  of  some  dependence,  we  can  control  the  size 
of  the  conditional  expectation  uniformly  if  we  control  the  size  of  the  individual  summands. 

Using  the  Stein  pair  (A,  A')  and  the  expression  (9.3),  we  may  develop  a  variety  of  concentration 
inequalities  for  conditionally  zero-mean  sums  that  are  analogous  to  our  results  for  independent 
sums.  We  omit  detailed  examples. 

10.  Combinatorial  Sums  of  Matrices 

The  method  of  exchangeable  pairs  can  also  be  applied  to  many  types  of  highly  symmetric  dis¬ 
tributions.  In  this  section,  we  study  a  class  of  combinatorial  matrix  statistics,  which  generalize  the 
scalar  statistics  studied  by  Hoeffding  [Hoe51]. 

10.1.  Formulation.  Consider  a  deterministic  array  (Aj^)"fc=1  of  Hermitian  matrices,  and  let  n  be 
a  uniformly  random  permutation  on  {1, . . . ,  n}.  Define  the  random  matrix 

1  x — 

■  ,  AMJ)  whose  mean  EY  =  -  V . ,  ,  Ajk-  (10.1) 

The  combinatorial  sum  A  is  a  natural  candidate  for  an  exchangeable  pair  analysis  because  the 
random  permutation  is  highly  symmetric.  Before  we  describe  how  to  construct  a  matrix  Stein  pair, 
let  us  mention  a  few  problems  that  lead  to  a  random  matrix  of  the  form  Y . 

Example  10.1  (Sampling  without  Replacement).  Consider  a  finite  collection  38  :=  {B i, . . . ,  Bn} 
of  deterministic  Hermitian  matrices.  Suppose  that  we  want  to  study  a  sum  of  s  matrices  sampled 
randomly  from  38  without  replacement.  We  can  express  this  type  of  series  in  the  form 

w  ■■=  EU 

where  n  is  a  random  permutation  on  { 1 , ,n\.  The  matrix  W  is  therefore  an  example  of  a 
combinatorial  sum. 

Example  10.2  (A  Randomized  “Inner  Product”).  Consider  two  fixed  sequences  of  complex  ma¬ 
trices 


6  CdlXS  and  Cx, . . . ,  Cn  G  Csxd2 . 
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We  may  form  a  permuted  matrix  “inner  product”  by  arranging  one  sequence  in  random  order, 
multiplying  the  elements  of  the  two  sequences  together,  and  summing  the  terms.  In  other  words, 
we  are  interested  in  the  random  matrix 

z B>c*uy 

Then  the  random  matrix  @(Z)  is  a  combinatorial  sum  of  Hermitian  matrices. 

10.2.  A  Matrix  Stein  Pair.  To  study  the  combinatorial  sum  (10.1)  of  matrices  using  the  method 
of  exchangeable  pairs,  we  first  introduce  the  zero-mean  random  matrix 

X  :=Y-EY. 


To  construct  a  matrix  Stein  pair  (X,  X'),  we  draw  a  pair  ( J,  I\ )  of  indices  independently  of  ir  and 
uniformly  at  random  from  {1, . . .  ,  n}2.  Define  a  second  random  permutation  n '  :=  no  ( J,K )  by 
composing  n  with  the  transposition  of  the  random  indices  J  and  K.  The  pair  (n,  n')  is  exchangeable, 
so  the  matrix 

is  an  exchangeable  counterpart  to  X. 

To  verify  that  (X,  X7)  is  a  matrix  Stein  pair,  we  calculate  that 

E[X  -  X'  |  n\  =  E  [Aj7r(-j)  +  AKn(Kj  -  AJir^  -  AKn^  |  vr] 

1  2  2 
=  /  'jj—1  LAM.?)  +  ^kn(k)  ~  ■^■jir(k)  ~  ^kn(j)\  =  'Eh)  =  —X. 

The  first  identity  holds  because  the  random  sums  X  and  X'  differ  for  only  four  choices  of  indices. 
We  see  that  (X,X7)  is  a  Stein  pair  with  scale  factor  a  =  2/n. 

Turning  to  the  conditional  variance,  we  find  that 

Ajy(tt)  =  —  E  [(X  —  X  )  n\  =  —  y;.fc  :  [Ajn(j)  +  Afc7r(fc)  —  Ajn^  —  Akn^]  .  (10.2) 

The  structure  of  the  conditional  variance  is  somewhat  different  from  previous  examples,  but  we 
recognize  that  Ax  is  controlled  when  the  matrices  Ajk  are  bounded. 

10.3.  Exponential  Concentration  for  a  Combinatorial  Sum.  We  can  apply  our  matrix  con¬ 
centration  results  to  study  the  behavior  of  a  combinatorial  sum  of  matrices.  As  an  example,  let  us 
present  a  Bernstein- type  inequality.  The  argument  is  similar  to  the  proof  of  Corollary  5.2,  so  we 
leave  the  details  to  Appendix  C. 


Corollary  10.3  (Bernstein  Inequality  for  a  Combinatorial  Sum  of  Matrices).  Consider  an  array 
(Ajk)™k=1  of  deterministic  matrices  in  that  satisfy 

ETL 

^  ^Ajk  =  0  and  ||  Aj-fc  ||  <  R  for  each  pair  (j ,  k)  of  indices. 

Define  the  random  matrix  X  :=  X)j=i  A^y),  where  n  is  a  uniformly  random  permutation  on 
{l,...,n}.  Then,  for  all  t  >  0, 


E  {Amax(X)  >  t}  <  d  ■  exp 
Furthermore, 


\ 


-F 

12cr2  +  4\/2  Rt  j 


where  a2  :=  — 

n  I'A- ' 

E  Amax(X)  <  o\Jvl  log  d  +  2\pl  i?log  d. 


j,k= l  aa 


11.  Self-Reproducing  Matrix  Functions 


The  method  of  exchangeable  pairs  can  also  be  used  to  analyze  nonlinear  matrix- valued  functions 
of  random  variables.  In  this  section,  we  explain  how  to  analyze  matrix  functions  that  satisfy  a 
self-reproducing  property. 
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11.1.  Example:  Matrix  Second-Order  Rademacher  Chaos.  We  begin  with  an  example  that 
shows  how  the  self-reproducing  property  might  arise.  Consider  a  quadratic  form  that  takes  on 
random  matrix  values: 

H (£) :=  '52kHj<k£j£*A*-  (1L1) 

In  this  expression,  e  is  a  finite  vector  of  independent  Rademacher  random  variables.  The  array 
iAjk)  j,k>i  consists  of  deterministic  Hermitian  matrices,  and  we  assume  that  Ajk  =  Akj. 

Observe  that  the  summands  in  H (e)  are  dependent  and  that  they  need  not  satisfy  the  conditional 
zero- mean  property  (9.1).  Nevertheless,  H(e)  does  satisfy  a  fruitful  self-reproducing  property: 

(£)  -  nn(e)  i  (£j)#k  d  =  Ej¥fc  -  n^])Ajk 

=  Efc  E^fc  £j£kAjk  =  2  H(e). 

We  have  applied  the  pull-through  property  of  conditional  expectation,  the  assumption  that  the 
Rademacher  variables  are  independent,  and  the  fact  that  Ajk  =  Akj.  As  we  will  see,  this  type  of 
self-reproducing  property  can  be  used  to  construct  a  matrix  Stein  pair. 

A  random  matrix  of  the  form  (11.1)  is  called  a  second-order  Rademacher  chaos.  This  class  of 
random  matrices  arises  in  a  variety  of  situations,  including  randomized  linear  algebra  [CDlla], 
compressed  sensing  [RaulO,  Sec.  9],  and  chance-constrained  optimization  [CSW11].  Indeed,  con¬ 
centration  inequalities  for  the  matrix-valued  Rademacher  chaos  have  a  wide  range  of  potential 
applications. 

11.2.  Formulation  and  Matrix  Stein  Pair.  In  this  section,  we  describe  a  more  general  version 
of  the  self-reproducing  property.  Suppose  that  2  :=  (Zi, . . . ,  Zn)  is  a  random  vector  taking  values 
in  a  Polish  space  Z.  First,  we  construct  an  exchangeable  counterpart 

z1  :  =  (Zi, . . . ,  ZK_ i,  Z'K,  Zk+ i,  •  •  • ,  Zn)  (11-2) 

where  Zk  and  Z'k  are  conditionally  i.i.d.  given  (Zj)j^k  and  K  is  an  independent  coordinate  drawn 
uniformly  at  random  from  {1 , ,n}. 

Next,  let  H  :  Zn  -a  Md  be  a  bounded  measurable  function.  Assume  that  H(z )  satisfies  an 
abstract  self-reproducing  property:  for  a  parameter  s  >  0, 

V"  (H(z)  -  E[H(ZU  ...,Z'k,...,  Zn)  |  z})  =  s  •  (H(z)  -  E  H(z))  almost  surely.  (11.3) 

Under  this  assumption,  we  can  easily  check  that  the  random  matrices 

X  :=  H(z)  -EH(z)  and  X'  :=  H(z')  —EH{z) 

form  a  matrix  Stein  pair.  Indeed, 

E[X  -  X'  I  z]  =  EL H(z)  -  H(z')  \  z\  =  -(H(z)  -  E  H(z))  =  -X. 

n  n 

We  determine  that  (X,  X ')  is  a  matrix  Stein  pair  with  scaling  factor  a  =  s/n. 

Finally,  we  compute  the  conditional  variance: 

A x(z)  =  ys  E  [(H(z)  -  H(z'))2  \z\=ys  ELi E  ~  H{Zl > ' '  •  >  Z'ki  •  •  •  >  zn)f  I  z] . 

We  discover  that  the  conditional  variance  is  small  when  each  coordinate  of  H  has  controlled  vari¬ 
ance.  In  this  case,  the  method  of  exchangeable  pairs  provides  good  concentration  inequalities  for 
the  random  matrix  X . 
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11.3.  Matrix  Bounded  Differences  Inequality.  As  an  example  of  this  framework,  we  can 
develop  a  bounded  differences  inequality  for  random  matrices  by  appealing  to  Theorem  4.1. 


Corollary  11.1  (Matrix  Bounded  Differences).  Let  z  :=  (Z\, . . . ,  Zn)  be  a  random  vector  taking 
values  in  a  Polish  space  Z,  and  let  z'  be  an  exchangeable  counterpart,  constructed  as  in  (11.2). 
Suppose  that  H  :  Zn  — >  is  a  bounded  function  that  satisfies  the  self-reproducing  property 

E7X  /  .  > 

( H(z )  -  ELHYZi, . . .  ,ZL  . . .  ,Zn)  I  z])  =  s  •  ( H(z )  -EH(z))  almost  surely 
for  a  parameter  s  >  0  as  well  as  the  bounded  differences  condition 

E  [(H(z)  —  H(Zi, . . . ,  Z'k, . . . ,  Zn))2  |  z]  ^  A\  almost  surely  for  each  index  k,  (11.4) 
where  Ak  is  a  deterministic  matrix  in  IF*.  Then,  for  all  t  >  0, 

E{Am ay(H(z)  —  E H(z))  >  t}  <  d  -  e~^2//L  where  L  :=  IIV^  A\  . 

1 1  — 'k=l 

Furthermore, 

EA max(H(z)  -  E  H(z))  <  \jL^- 

In  the  scalar  setting,  Corollary  11.1  reduces  to  a  version  of  McDiarmid’s  bounded  difference  in¬ 
equality  [McD89].  The  result  also  complements  the  matrix  bounded  difference  inequality  of  [Trollb, 
Cor.  7.5],  which  requires  independent  input  variables  but  makes  no  self-reproducing  assumption. 


Proof.  Since  H(z)  is  self-reproducing,  we  may  construct  a  matrix  Stein  pair  (X,X')  with  scale 
factor  a  =  s/n  as  in  Section  11.  According  to  (11.2),  the  conditional  variance  of  the  pair  satisfies 


1 


‘X  = 


2.s 


ki  ' '  '  ) 


Zn)f 


k= 1 


At  =4  7T-  I 

2s 


We  have  used  the  bounded  differences  condition  (11.4)  and  the  definition  of  the  bound  L.  To 
complete  the  proof,  we  apply  the  concentration  result,  Theorem  4.1,  with  the  parameters  c  =  0 
and  v  =  L/2s.  □ 


12.  Extensions  and  Future  Work 

Although  the  examples  we  treat  in  this  work  all  involve  matrix  Stein  pairs,  we  are  aware  of 
problems  that  require  a  larger  class  of  exchangeable  pairs.  The  following  definition  represents  a 
significant  extension  of  the  matrix  Stein  pair  formalism. 

Definition  12.1  (Generalized  Matrix  Stein  Pair).  Let  g  :  M  — >  M  be  a  weakly  increasing  function. 
Let  (Z,  Z')  be  an  exchangeable  pair  of  random  variables  taking  values  in  a  Polish  space  Z ,  and  let 
T'  :  Z  — >  Md  be  a  measurable  function.  Define  the  random  Hermitian  matrices 

X  :=  V(Z)  and  X'  :=  V(Z'). 

We  say  that  (X,  X')  is  a  generalized  matrix  Stein  pair  if  there  is  a  constant  a  G  (0, 1]  for  which 

E[g(X)  —  g(X')  \  Z]  =  aX  almost  surely. 

When  the  function  g  is  linear,  a  generalized  matrix  Stein  pair  reduces  to  a  matrix  Stein  pair.  The 
conditional  variance  of  the  generalized  pair  is  defined  as 

:=  A X(Z)  :=  ^  •  ReE  [(g(X)  -  g(X'))  •  (X  -  X')  |  Z] , 

where  Re(A)  :=  ^ (A  +  A*)  refers  to  the  Hermitian  part  of  a  square  matrix. 
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We  have  chosen  to  focus  on  the  simpler  definition  of  a  Stein  pair  because  the  presentation  is 
more  transparent.  Nevertheless,  the  results  in  this  paper  all  extend  readily  to  generalized  matrix 
Stein  pairs  with  little  extra  work.  We  leave  applications  of  generalized  matrix  Stein  pairs  for  future 
research. 
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Appendix  A.  Proof  of  Theorem  5.1:  Refined  Exponential  Concentration 

The  proof  of  Theorem  5.1  parallels  the  argument  in  Theorem  4.1,  but  it  differs  at  an  important 
point.  In  the  earlier  result,  we  used  an  almost-sure  bound  on  the  conditional  variance  to  control 
the  derivative  of  the  trace  mgf.  This  time,  we  use  entropy  inequalities  to  introduce  finer  informa¬ 
tion  about  the  behavior  of  the  conditional  variance.  The  proof  is  essentially  a  matrix  version  of 
Chatterjee’s  argument  [Cha08,  Thm.  3.13]. 

Let  (A^X')  be  a  matrix  Stein  pair.  Consider  the  normalized  trace  mgf 

m(9)  :=  Etreex  for  6  >  0.  (A.l) 

Our  main  object  is  to  establish  the  following  inequality  for  the  trace  mgf. 

log  m(Q)  <  2(i^g2  for  0  <®  <  and  V’  >  0. 


(A.2) 
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The  function  r(ip)  is  defined  in  (5.1).  We  establish  (A. 2)  in  Section  A.l  et  seq.  Afterward,  in 
Section  A. 5,  we  invoke  the  matrix  Laplace  transform  bound  to  complete  the  proof  of  Theorem  5.1. 

A.l.  The  Derivative  of  the  Trace  Mgf.  The  first  steps  of  the  argument  are  the  same  as  in 
the  proof  of  Theorem  4.1.  Since  X  is  almost  surely  bounded,  we  need  not  worry  about  regularity 
conditions.  The  derivative  of  the  trace  mgf  satisfies 

m'(6)  =  Etr  [Xeex]  for  9  €  R.  (A.3) 

Lemma  3.7  provides  a  bound  for  the  derivative  in  terms  of  the  conditional  variance: 

m'{9)  <  9  •  Etr  [Ax  eex]  for  9  >  0.  (A. 4) 

In  the  proof  of  Lemma  4.3,  we  applied  an  almost-sure  bound  for  the  conditional  variance  to  control 
the  derivative  of  the  mgf.  This  time,  we  incorporate  information  about  the  typical  size  of  Ax  by 
developing  a  bound  in  terms  of  the  function  r(ip). 

A. 2.  Entropy  for  Random  Matrices  and  Duality.  Let  us  introduce  an  entropy  function  for 
random  matrices. 

Definition  A.l  (Entropy  for  Random  Matrices).  Let  W  be  a  random  matrix  in  Hi.  subject  to 
the  normalization  EtrVE  =  1.  The  (negative)  matrix  entropy  is  defined  as 

ent(W')  :=Etr(WlogW).  (A.5) 

We  enforce  the  convention  that  OlogO  =  0. 

The  matrix  entropy  is  relevant  to  our  discussion  because  its  Fenchel-Legendre  conjugate  is  the 
cumulant  generating  function.  The  Young  inequality  for  matrix  entropy  provides  one  way  to  for¬ 
mulate  this  duality  relationship. 

Proposition  A. 2  (Young  Inequality  for  Matrix  Entropy).  Suppose  that  V  is  a  random  matrix  in 
that  is  almost  surely  bounded  in  norm,  and  suppose  that  W  is  a  random  matrix  in  Hkf.  subject 
to  the  normalization  E  tr  XV  =  1 .  Then 

Etr(VVE)  <  logEtrev  +  ent(W). 

Proposition  A. 2  follows  from  an  easy  variant  of  the  argument  in  [CarlO,  Thm.  2.13]. 


A.3.  A  Refined  Differential  Inequality  for  the  Trace  Mgf.  We  intend  to  apply  the  Young 
inequality  for  matrix  entropy  to  decouple  the  product  of  random  matrices  in  (A. 4).  First,  we  must 
rescale  the  exponential  in  (A. 4)  so  its  expected  trace  equals  one: 

XV  (9)  :=  ■  eex  =  — ■  eex .  (A. 6) 

Etre0X  m(9) 

For  each  ip  >  0,  we  can  rewrite  (A. 4)  as 

m'(0)  <  -  Etr  [ip Ax  •  W(9)\ . 

The  Young  inequality  for  matrix  entropy,  Proposition  A. 2,  implies  that 

( o\  ^  @  m(@)  pi — ,ur„JAv  i  — wu/T/aul  r  \  ^ 


m'(9)  <  —  logEtre^AjSf  +  ent(W(0)) 

w  L 


The  first  term  in  the  bracket  is  precisely  ipr(ip).  Let  us  examine  the  second  term  more  closely. 

To  control  the  matrix  entropy  of  XV (6),  we  need  to  bound  its  logarithm.  Referring  back  to  the 
definition  (A. 6),  we  see  that 

log  W(9)  =  9X  -  (logEt“reex)  •  I  ^  9X  —  (logtre0EX)  •  I  =  OX.  (A.8) 
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The  second  relation  depends  on  Jensen’s  inequality  and  the  fact  that  the  trace  exponential  is 
convex  [Pet 94,  Sec.  2].  The  third  relation  relies  on  the  property  that  EX  =  0.  Since  the  matrix 
W(9)  is  positive,  we  can  substitute  the  semidefinite  bound  (A. 8)  into  the  definition  (A. 5)  of  the 
matrix  entropy: 

ent(JT(0))  =  Etr  [W(9)  ■  log  W(9)\  <  9  ■  Etr  [W{9)  ■  X ]  =  •  Etr  [Xeex] . 

We  have  reintroduced  the  definition  (A. 6)  of  W (9)  in  the  last  relation.  Identify  the  derivative  (A. 3) 
of  the  trace  mgf  to  reach 

ent(W(0))  <  (A.9) 

m(9 ) 

To  establish  a  differential  inequality,  substitute  the  definition  (5.1)  of  r(if>)  and  the  bound  (A.9) 
into  the  estimate  (A. 7)  to  discover  that 


m!(9)  < 


9m{9) 

V’ 


r(ip)  + 


9  m!  (9) 
m(9 ) 


92 

r(il>)  9  ■  m(9)  H — -  •  m'{9). 
V 


Rearrange  this  formula  to  isolate  the  log-derivative  m!{9)/m{9)  of  the  trace  mgf.  We  conclude  that 


d 9 


log  m(9)  < 


r(V’)  9 
1  -92/^ 


for  0  <  9  <  i/V’. 


(A.10) 


A. 4.  Solving  the  Differential  Inequality.  To  integrate  (A.  10),  recall  that  logm(0)  =  0,  and 
invoke  the  fundamental  theorem  of  calculus  to  see  that 

r e  r(ijj)  s  f9  r(^)  s  92 


log  771.(6* )  < 


ds  < 


ds  = 


/ 0  1  -92!^  2(1  —  92 /ip) ' 

This  calculation  is  valid  whenever  0  <  9  <  This  is  the  claim  (A. 2). 


A. 5.  The  Matrix  Laplace  Transform  Argument.  With  the  trace  mgf  bound  (A. 2)  at  hand, 
we  can  complete  the  proof  of  Theorem  5.1.  Proposition  3.3,  the  matrix  Laplace  transform  method, 
yields  the  estimate 

P{Amax(X)  >  t}  <  d-  inf  exp  l-9t  +  ^  •  ?  ^1°,  \  <  d-  e~tg*w/2,  (A.ll) 

o<0<^0  L  2  1  -  9z/i>  j 

where  we  define  0*(t)  implicitly  as  the  positive  root  of  the  quadratic  equation 

r(ip)  9 

1  -  92/iJj  =  L 

Solve  the  quadratic  equation  to  obtain  the  explicit  formula 


8*(t)  = 


r(il>) 
2 1 


'1  + 


At2 


i/i 


-  1 


(A.  12) 


The  numerical  fact  \/l  +  a  <  1  +  y/a,  valid  for  a  >  0,  allows  us  to  verify  that  0*(t)  <  yfip.  We  can 
obtain  a  lower  bound 

9*(t)  -  m  +  Tr7$  (A'13) 

from  the  inequality  y/\  +  a  —  1  >  a/ (2  +  y/a),  which  holds  for  a  >  0.  Indeed, 

a  =  [y/ 1  +  ci  —  1)(\/ 1  +  0  +  1)  ^  (vT+a  —  1)(2  +  y/~o!) ■ 

Introduce  the  estimate  (A.  13)  into  the  probability  inequality  (A.ll)  to  complete  the  proof  of  the 
tail  bound  (5.2). 
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To  establish  the  inequality  (5.3)  for  the  expectation  of  the  maximum  eigenvalue,  we  can  apply 
Proposition  3.3  and  the  trace  mgf  bound  (A. 2)  a  second  time.  Indeed, 

EA„„(X)<  inf  [M  +  LJ«L}<,JW  +  !1 

v  O<0<^1  0  2  l-02/v¥J  ~*>o  Wo  2  / 

where  0*(i)  is  the  function  defined  in  (A.  12).  Incorporate  the  lower  bound  (A.  13)  to  reach 
ttt*  \  / v \  ^  fr(^)logd  ,  t  ,  logd)  ,  log d 


logd  1  r(ijj)  0 

~e~  +  2 '  i  - 


logd  t 

ejt)  +  2 


C(A)  <  inf 

t>o 


r(tp)  log  d  t  logd 

t  +  2  +  W 


=  \/2r(ip)  log  d  + 


This  observation  completes  the  proof  of  Theorem  5.1. 


Appendix  B.  Proof  of  Theorem  7.5:  Matrix  Rosenthal  Inequality 

The  proof  of  the  matrix  Rosenthal  inequality  takes  place  in  two  steps.  First,  we  verify  that  the 
bound  (7.3)  holds  for  psd  random  matrices.  Then,  we  use  this  result  to  provide  a  short  proof  of 
the  bound  (7.4)  for  Hermitian  random  matrices.  Before  we  start,  let  us  remind  the  reader  that  the 
Lp  norm  of  a  scalar  random  variable  Z  is  given  by  the  expression  (E  \  Z\P)1/'P  for  each  p  >  1. 

B.l.  A  Sum  of  Random  Psd  Matrices.  We  begin  with  the  moment  bound  (7.3)  for  an  inde¬ 
pendent  sum  of  random  psd  matrices.  Introduce  the  quantity  of  interest 


We  may  invoke  the  triangle  inequality  for  the  L2P  norm  to  obtain 

n  (  IIt-^  2 p\l/2p  /  „  \l/2p 

E  s(E||Ei(p‘-|E^IIJ  +l!EiEP4p=:(Eiixe)  +fI- 

We  can  apply  the  matrix  BDG  inequality  to  control  this  expectation,  which  yields  an  algebraic 
inequality  between  E 2  and  E.  We  solve  this  inequality  to  bound  E2 . 

The  series  X  consists  of  centered,  independent  random  matrices,  so  we  can  use  the  Stein  pair 
described  in  Section  2.4.  According  to  (2.6),  the  conditional  variance  Ax  takes  the  form 

=  \  Y,k  i(pk  -  E  pk? + npk  -  e  pk )2] 

^  2  E,  i2Pk  +  2(E  pkf  +  E  pk  -  (E  ^-)2] 

^Efc  (A2  +  ep,2). 

The  first  inequality  follows  from  the  operator  convexity  (1.1)  of  the  square  function;  we  have 
computed  the  second  expectation  exactly.  The  last  bound  follows  from  the  operator  Jensen  in¬ 
equality  (1.2).  Now,  the  matrix  BDG  inequality  yields 


The  third  line  follows  from  the  triangle  inequality  for  the  Lp  norm  and  Jensen’s  inequality. 

Next,  we  search  for  a  copy  of  E2  inside  this  expectation.  To  accomplish  this  goal,  we  want  to 
draw  a  factor  Pk  off  of  each  term  in  the  sum.  The  following  result  of  Pisier  and  Xu  [PX97,  Lem.  2.6] 
has  the  form  we  desire. 
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Proposition  B.l  (A  Matrix  Schwarz-Type  Inequality).  Consider  a  finite  sequence  {Afi)k>  1  of 
deterministic  psd  matrices.  For  each  p  >  1, 

o\1/2p\ 


E,Ai  ^  E 


2  p  } 


Wp 


E^ 


2 V 

Apply  the  matrix  Schwarz- type  inequality,  Proposition  B.l,  to  reach 

\  1/2  n n  n  1 1/2P 


E2<^4^2-  E^JP, 


Tfcllg) 


<V^2-(E  E||P; 


Allg) 


1/4 v 


E 


+  P 
2p\  V4 P 
2  P 


+  p. 


The  second  bound  is  the  Cauchy-Schwarz  inequality  for  expectation.  The  resulting  estimate  takes 
the  form  E 2  <  cE  +  p.  Solutions  of  this  quadratic  inequality  must  satisfy  E  <  c  +  yffi.  We  reach 


e<  (E,EH^ 


1 2  p\ 
1 2  p) 


1/4 V 


+  Et.EPfc 


1/2 

2p 


Square  this  expression  to  complete  the  proof  of  (7.3). 


B.2.  A  Sum  of  Centered,  Random  Hermitian  Matrices.  We  are  now  prepared  to  establish 
the  bound  (7.4)  for  a  sum  of  centered,  independent,  random  Hermitian  matrices.  Define  the  random 
matrix  X  :=  Y);.  We  may  use  the  matrix  Stein  pair  described  in  Section  2.4.  According  to  (2.6), 

the  conditional  variance  Ax  takes  the  form 


^x  = 


2  ^ k 

The  matrix  BDG  inequality,  Theorem  7.1,  yields 

1/4 P 


sE.W  +  e*?)- 


emU) 


<  y/  Ap  —  1  ■  (E  ||  Ax 


,2p\  1/4p 

1 2  p) 


=  \J  Fp  —  \  •  E 


W  +  Elf) 


l/4p 


<y/&=l.(E  v.  ns 


2p\  4/4P 

2p, 


The  third  line  follows  from  the  triangle  inequality  for  the  L2P  norm  and  Jensen’s  inequality.  To 
bound  the  remaining  expectation,  we  simply  note  that  the  sum  consists  of  independent,  random  psd 
matrices.  We  complete  the  proof  by  invoking  the  matrix  Rosenthal  inequality  (7.3)  and  simplifying. 


Appendix  C.  Proof  of  Theorem  10.3:  A  Combinatorial  Sum  of  Matrices 

Consider  the  matrix  Stein  pair  (X,X')  constructed  in  Section  10.2.  The  expression  (10.2)  and 
the  operator  convexity  (1.1)  of  the  matrix  square  allow  us  to  bound  the  conditional  variance  as 
follows. 


Ax(n> =  u 


In  'Ej.t-l  A'(j)  +  A 


kn(k)  J^-j7r(k) 


kn 


U)] 


^  n  Ej;fe= i  I +  ^-kir(k)  +  -^jTr(fe)  +  ^kir(j)] 

En  .  o  2  x — 

i=i  AMJ)  +  n  E, 


'j,k=i  A 


=  W  +  4E 
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where 

"  := 2  A%u>)  ~ 2  s  s  =4  A%- 

Substitute  the  bound  for  Ax  (7r)  into  the  definition  (5.1)  of  r{ip)  to  see  that 

r(ip)  ■=  —  logEtr e^A'x(-n')  <  —  logEtr  e^w+4S)  <  4cx2  H - logEtre^^.  (C.l) 

ip  ip  ip 

The  inequalities  follow  from  the  monotonicity  of  the  trace  exponential  [Pet94,  Sec.  2]  and  the  fact 
that  a 2  =  ||X||.  Therefore,  it  suffices  to  bound  the  trace  mgf  of  W. 

Our  approach  is  to  construct  a  matrix  Stein  pair  for  W  and  to  argue  that  the  associated  condi¬ 
tional  variance  A  w(tt)  satisfies  a  semidefinite  bound.  We  may  then  exploit  the  trace  mgf  bounds 
from  Lemma  4.3.  Observe  that  W  and  X  take  the  same  form:  both  have  mean  zero  and  share  the 
structure  of  a  combinatorial  sum.  Therefore,  we  can  study  the  behavior  of  W  using  the  matrix 
Stein  pair  from  Section  10.2.  Adapting  (10.2),  we  see  that  the  conditional  variance  of  W  satisfies 

1 


A  wM  = 


n 


V  f  A2  -l  A2  -A2  -A2 

2-^j.k=l  L  +  Akir(k)  Ajir(k)  Akir~ 


4  v — 
n 

4  R2 

^  - 

n  2 — 


^  ^k7r(k)  J^j7r(k)  ^kn 


O')]* 

(?)] 


\  A2  A  2  I  A  2  \  A2, 

Z^7-  k=  1  L  ^  Akn(k)  T  Ajw(k)  +  Akir~ 


O').  ■ 


In  the  first  line,  the  centering  terms  in  W  cancel  each  other  out.  Then  we  apply  the  operator 
convexity  (1.1)  of  the  matrix  square  and  the  bound  A  jk  ^  R2  A2 k.  Finally,  identify  W  and  S  to 
reach 

Aw{ir)  ^  4 R2(W  +  4S)  ^4 R2  ■  W  +  16 R2a2  •  I  (C.2) 

The  matrix  inequality  (C.2)  gives  us  access  to  established  trace  mgf  bounds.  Indeed, 

,  tut  ^  8 R2cr2ip2 
logEtre 

as  a  consequence  of  Lemma  4.3  with  parameters  c  =  4 R2  and  v  =  16R2a2. 

At  last,  we  substitute  the  latter  bound  into  (C.l)  to  discover  that 

8  R2o2iP 


r{ip)  <  4ct"  + 


1  -  4 R2ip ' 


In  particular,  setting  ip  =  (8 R2)  i,  we  find  that  r(ip)  <  6 a2.  Apply  Theorem  5.1  to  wrap  up. 


